Celestin Apprentice 5

home *** CD-ROM | disk | FTP | other *** search

/ Celestin Apprentice 5 / Apprentice-Release5.iso / Source Code / PowerPlant / DMultiStringLocator / About DMultiStringLocator next >

Wrap

Text File | 1996-07-06 | 5KB | 113 lines

About DMultiStringLocator ========================= Written July 6, 1996 by Eric Gundrum <eric@teamworks.com> Preface ------- This document describes the DMultiStringLocator class. It contains these sections: - What is DMultiStringLocator - Implementation Notes - About the Sample Code - Use Restrictions What is DMultiStringLocator --------------------------- DMultiStringLocator is a C++ class to provide fast searching for one or more strings. It is an implementation of a table-based algorithm credited to Aho and Corasick and described in "Practical Algorithms for Programmers" by Andrew Binstock and John Rex (published in 1995 by Addison Wesley). DMultiStringLocator is designed to search for more than one string at a time. Searching for multiple strings works nearly as fast as searching for a single string. Using a table-lookup mechanism eliminates any unnecessary backtracking when a partial match fails, accounting for much of the speed. This algorithm may not be the fastest available, but it should be sufficient for most practical problems. It is general enough to be used when searching for a single string, as well as for more than one in the same target text. Implementation Notes -------------------- The current implementation expects a list of DSearchString objects when the DMultiStringLocator object is instantiated. Internal state tables are built from this list in the constructor. (This allocates memory; about eight times the combined length of the search strings and at least two times the alphabet size.) Changing the list requires constructing a new search object (and its state tables). A DSearchString object, based on PowerPlant's LStr255 class, contains a ReportFound() virtual member function that must be overridden in a derived class. ReportFound() is executed for each found string. (This is a performance bottleneck; keep ReportFound() short and fast.) A parameter to ReportFound() is the position in the string of the last character of the found string. Use this value to locate the found strings after the search phase. A search is initiated with a call to SearchBuffer(), which will search a RAM-based buffer (by walking a pointer through the buffer). Searching more data than will fit in one buffer can be accomplished with repeated calls to SearchBuffer(), but you must deal with a match string that is split over two buffers. To help handle strings split across the buffer, SearchBuffer() returns the maximum number of characters which might belong to a string found at the end of the current buffer. These characters should be included in the next buffer. However, if these characters also include other complete strings, those strings will be found twice. For example, consider searching for 'foobarooba' and 'bar', where the first buffer ends with 'foobaroo'. SearchBuffer() will return 8, indicating that the second buffer should begin 8 characters before the end of the first. However, such as search will find 'bar' twice. This is a design flaw I have not yet resolved. (It is not significant to my use.) I suggest working around it by looking for multiple hits on the same string in a post process phase (by tracking the hit position). Or, handle it in ReportFound() if performance is not an issue. Another limitation of this implementation is that it searches for a byte sequence. It does not support case-insensitive searches. However, that is easily remedied by changing the case of the search strings and input text. DMultiStringLocator currently relies on a few PowerPlant classes, including LList, LListIterator, and LString. With a bit of effort, these classes could be replaced with STL classes. I'll leave that for a future project. About the Sample Code --------------------- The sample code consists of the CSearchWindow class to demonstrate the search engine and a supporting PowerPlant application (PPShell). CSearchWindow builds the search list from user input (in DoSearch()). It then reads the target file into a buffer and performs a search on that buffer (in SearchFile()). If the file contains more data, the process is repeated until all is read. Note that SearchFile() will overlay the buffer ends to a deal with match strings split over two buffers. CSearchString is an important support class used by CSearchWindow. Derived from DSearchString, CSearchString demonstrates how to minimally collect the results of a search with the ReportFound() function. Here ReportFound() is used to count the number of hits. It also tracks the position of found strings to prevent reporting them twice. Use Restrictions ---------------- I place no restrictions on how this code may be used or distributed. I ask only that my copyright statement be retained in the source files of all derivative works. (I put a lot of work into this class and I want people to know who did it.) If you have comments or want to report bugs, send mail to <eric@teamworks.com>.